Buying and selling used phones and tablets used to be something that happened on a handful of online marketplace sites. But the used and refurbished device market has grown considerably over the past decade, and a new IDC (International Data Corporation) forecast predicts that the used phone market would be worth \$52.7bn by 2023 with a compound annual growth rate (CAGR) of 13.6% from 2018 to 2023. This growth can be attributed to an uptick in demand for used phones and tablets that offer considerable savings compared with new models.
Refurbished and used devices continue to provide cost-effective alternatives to both consumers and businesses that are looking to save money when purchasing one. There are plenty of other benefits associated with the used device market. Used and refurbished devices can be sold with warranties and can also be insured with proof of purchase. Third-party vendors/platforms, such as Verizon, Amazon, etc., provide attractive offers to customers for refurbished devices. Maximizing the longevity of devices through second-hand trade also reduces their environmental impact and helps in recycling and reducing waste. The impact of the COVID-19 outbreak may further boost this segment as consumers cut back on discretionary spending and buy phones and tablets only for immediate needs.
The rising potential of this comparatively under-the-radar market fuels the need for an ML-based solution to develop a dynamic pricing strategy for used and refurbished devices. ReCell, a startup aiming to tap the potential in this market, has hired you as a data scientist. They want you to analyze the data provided and build a linear regression model to predict the price of a used phone/tablet and identify factors that significantly influence it.
The data contains the different attributes of used/refurbished phones and tablets. The data was collected in the year 2021. The detailed data dictionary is given below.
Data Dictionary
# this will help in making the Python code more structured automatically (good coding practice)
%load_ext nb_black
import pandas as pd
import numpy as np
# for visualizing data
import matplotlib.pyplot as plt
import seaborn as sns
# For randomized data splitting
from sklearn.model_selection import train_test_split
# To build linear regression_model
import statsmodels.api as sm
# To check model performance
from sklearn.metrics import mean_absolute_error, mean_squared_error
# Removes the limit for the number of displayed columns
pd.set_option("display.max_columns", None)
# Sets the limit for the number of displayed rows
pd.set_option("display.max_rows", 200)
%matplotlib inline
df = pd.read_csv("used_device_data.csv")
print(f"There are {df.shape[0]} rows and {df.shape[1]} columns.") # f-string
np.random.seed(1)
df.sample(n=10)
There are 3454 rows and 15 columns.
| brand_name | os | screen_size | 4g | 5g | main_camera_mp | selfie_camera_mp | int_memory | ram | battery | weight | release_year | days_used | normalized_used_price | normalized_new_price | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 866 | Others | Android | 15.24 | no | no | 8.00 | 2.0 | 16.0 | 4.00 | 3000.0 | 206.0 | 2014 | 632 | 4.038832 | 5.190788 |
| 957 | Celkon | Android | 10.16 | no | no | 3.15 | 0.3 | 512.0 | 0.25 | 1400.0 | 140.0 | 2013 | 637 | 2.800325 | 3.884652 |
| 280 | Infinix | Android | 15.39 | yes | no | NaN | 8.0 | 32.0 | 2.00 | 5000.0 | 185.0 | 2020 | 329 | 4.370713 | 4.487287 |
| 2150 | Oppo | Android | 12.83 | yes | no | 13.00 | 16.0 | 64.0 | 4.00 | 3200.0 | 148.0 | 2017 | 648 | 4.677863 | 5.639422 |
| 93 | LG | Android | 15.29 | yes | no | 13.00 | 5.0 | 32.0 | 3.00 | 3500.0 | 179.0 | 2019 | 216 | 4.517650 | 5.300415 |
| 1040 | Gionee | Android | 12.83 | yes | no | 13.00 | 8.0 | 32.0 | 4.00 | 3150.0 | 166.0 | 2016 | 970 | 4.645640 | 5.634325 |
| 3170 | ZTE | Others | 10.16 | no | no | 3.15 | 5.0 | 16.0 | 4.00 | 1400.0 | 125.0 | 2014 | 1007 | 3.764451 | 4.244344 |
| 2742 | Sony | Android | 12.70 | yes | no | 20.70 | 2.0 | 16.0 | 4.00 | 3000.0 | 170.0 | 2013 | 1060 | 4.422809 | 5.799820 |
| 102 | Meizu | Android | 15.29 | yes | no | NaN | 20.0 | 128.0 | 6.00 | 3600.0 | 165.0 | 2019 | 332 | 4.959412 | 6.040659 |
| 1195 | HTC | Android | 10.29 | no | no | 8.00 | 2.0 | 32.0 | 4.00 | 2000.0 | 146.0 | 2015 | 892 | 4.227855 | 4.879007 |
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 3454 entries, 0 to 3453 Data columns (total 15 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 brand_name 3454 non-null object 1 os 3454 non-null object 2 screen_size 3454 non-null float64 3 4g 3454 non-null object 4 5g 3454 non-null object 5 main_camera_mp 3275 non-null float64 6 selfie_camera_mp 3452 non-null float64 7 int_memory 3450 non-null float64 8 ram 3450 non-null float64 9 battery 3448 non-null float64 10 weight 3447 non-null float64 11 release_year 3454 non-null int64 12 days_used 3454 non-null int64 13 normalized_used_price 3454 non-null float64 14 normalized_new_price 3454 non-null float64 dtypes: float64(9), int64(2), object(4) memory usage: 404.9+ KB
# looking at which columns have the most missing values
df.isnull().sum().sort_values(ascending=False)
main_camera_mp 179 weight 7 battery 6 int_memory 4 ram 4 selfie_camera_mp 2 brand_name 0 os 0 screen_size 0 4g 0 5g 0 release_year 0 days_used 0 normalized_used_price 0 normalized_new_price 0 dtype: int64
# Let's look at the statistical summary of the data
df.describe(include="all").T
| count | unique | top | freq | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| brand_name | 3454 | 34 | Others | 502 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| os | 3454 | 4 | Android | 3214 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| screen_size | 3454.0 | NaN | NaN | NaN | 13.713115 | 3.80528 | 5.08 | 12.7 | 12.83 | 15.34 | 30.71 |
| 4g | 3454 | 2 | yes | 2335 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 5g | 3454 | 2 | no | 3302 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| main_camera_mp | 3275.0 | NaN | NaN | NaN | 9.460208 | 4.815461 | 0.08 | 5.0 | 8.0 | 13.0 | 48.0 |
| selfie_camera_mp | 3452.0 | NaN | NaN | NaN | 6.554229 | 6.970372 | 0.0 | 2.0 | 5.0 | 8.0 | 32.0 |
| int_memory | 3450.0 | NaN | NaN | NaN | 54.573099 | 84.972371 | 0.01 | 16.0 | 32.0 | 64.0 | 1024.0 |
| ram | 3450.0 | NaN | NaN | NaN | 4.036122 | 1.365105 | 0.02 | 4.0 | 4.0 | 4.0 | 12.0 |
| battery | 3448.0 | NaN | NaN | NaN | 3133.402697 | 1299.682844 | 500.0 | 2100.0 | 3000.0 | 4000.0 | 9720.0 |
| weight | 3447.0 | NaN | NaN | NaN | 182.751871 | 88.413228 | 69.0 | 142.0 | 160.0 | 185.0 | 855.0 |
| release_year | 3454.0 | NaN | NaN | NaN | 2015.965258 | 2.298455 | 2013.0 | 2014.0 | 2015.5 | 2018.0 | 2020.0 |
| days_used | 3454.0 | NaN | NaN | NaN | 674.869716 | 248.580166 | 91.0 | 533.5 | 690.5 | 868.75 | 1094.0 |
| normalized_used_price | 3454.0 | NaN | NaN | NaN | 4.364712 | 0.588914 | 1.536867 | 4.033931 | 4.405133 | 4.7557 | 6.619433 |
| normalized_new_price | 3454.0 | NaN | NaN | NaN | 5.233107 | 0.683637 | 2.901422 | 4.790342 | 5.245892 | 5.673718 | 7.847841 |
# check column medians
df.median()
screen_size 12.830000 main_camera_mp 8.000000 selfie_camera_mp 5.000000 int_memory 32.000000 ram 4.000000 battery 3000.000000 weight 160.000000 release_year 2015.500000 days_used 690.500000 normalized_used_price 4.405133 normalized_new_price 5.245892 dtype: float64
# function to plot a boxplot and a histogram along the same scale.
def histogram_boxplot(data, feature, figsize=(12, 7), kde=False, bins=None):
"""
Boxplot and histogram combined
data: dataframe
feature: dataframe column
figsize: size of figure (default (12,7))
kde: whether to the show density curve (default False)
bins: number of bins for histogram (default None)
"""
f2, (ax_box2, ax_hist2) = plt.subplots(
nrows=2, # Number of rows of the subplot grid= 2
sharex=True, # x-axis will be shared among all subplots
gridspec_kw={"height_ratios": (0.25, 0.75)},
figsize=figsize,
) # creating the 2 subplots
sns.boxplot(
data=data, x=feature, ax=ax_box2, showmeans=True, color="violet"
) # boxplot will be created and a star will indicate the mean value of the column
sns.histplot(
data=data, x=feature, kde=kde, ax=ax_hist2, bins=bins, palette="winter"
) if bins else sns.histplot(
data=data, x=feature, kde=kde, ax=ax_hist2
) # For histogram
ax_hist2.axvline(
data[feature].mean(), color="green", linestyle="--"
) # Add mean to the histogram
ax_hist2.axvline(
data[feature].median(), color="black", linestyle="-"
) # Add median to the histogram
# function to create labeled barplots
def labeled_barplot(data, feature, perc=False, n=None):
"""
Barplot with percentage at the top
data: dataframe
feature: dataframe column
perc: whether to display percentages instead of count (default is False)
n: displays the top n category levels (default is None, i.e., display all levels)
"""
total = len(data[feature]) # length of the column
count = data[feature].nunique()
if n is None:
plt.figure(figsize=(count + 1, 5))
else:
plt.figure(figsize=(n + 1, 5))
plt.xticks(rotation=90, fontsize=15)
ax = sns.countplot(
data=data,
x=feature,
palette="Paired",
order=data[feature].value_counts().index[:n].sort_values(),
)
for p in ax.patches:
if perc == True:
label = "{:.1f}%".format(
100 * p.get_height() / total
) # percentage of each class of the category
else:
label = p.get_height() # count of each level of the category
x = p.get_x() + p.get_width() / 2 # width of the plot
y = p.get_height() # height of the plot
ax.annotate(
label,
(x, y),
ha="center",
va="center",
size=12,
xytext=(0, 5),
textcoords="offset points",
) # annotate the percentage
plt.show() # show the plot
plt.figure(figsize=(15, 5))
sns.countplot(data=df, x="brand_name")
plt.xticks(rotation=90)
plt.figure(figsize=(15, 5))
sns.countplot(data=df, x="os")
plt.xticks(rotation=90)
histogram_boxplot(df, "screen_size", kde=True)
df.groupby(["4g", "5g"]).size()
4g 5g
no no 1119
yes no 2183
yes 152
dtype: int64
histogram_boxplot(df, "main_camera_mp", kde=True)
histogram_boxplot(df, "selfie_camera_mp", kde=True)
histogram_boxplot(df, "int_memory", kde=True)
histogram_boxplot(df, "ram", kde=True)
histogram_boxplot(df, "battery", kde=True)
histogram_boxplot(df, "weight", kde=True)
plt.figure(figsize=(15, 5))
sns.countplot(data=df, x="release_year")
plt.xticks(rotation=90)
histogram_boxplot(df, "days_used", kde=True)
histogram_boxplot(df, "normalized_new_price", kde=True)
histogram_boxplot(df, "normalized_used_price", kde=True)
sns.displot(df["normalized_used_price"], color="lightsteelblue", kde=True)
plt.axvline(x=df.normalized_used_price.mean(), color="red")
plt.axvline(x=df.normalized_used_price.median(), color="black")
<matplotlib.lines.Line2D at 0x7fca8008ed60>
sns.boxplot(x=df["normalized_used_price"], showmeans=True, color="lightsteelblue")
<AxesSubplot:xlabel='normalized_used_price'>
labeled_barplot(df, "os", perc=True, n=None)
plt.figure(figsize=(15, 5))
plt.subplot(1, 2, 1)
sns.barplot(data=df, y="ram", x="brand_name")
plt.xticks(rotation=90)
plt.subplot(1, 2, 2)
sns.boxplot(data=df, y="ram", x="brand_name")
plt.xticks(rotation=90)
plt.show()
large_battery = df[df["battery"] >= 4500]
sns.scatterplot(x=large_battery["battery"], y=large_battery["weight"])
<AxesSubplot:xlabel='battery', ylabel='weight'>
plt.figure(figsize=(30, 10))
sns.barplot(data=large_battery, y="weight", x="battery")
plt.xticks(rotation=90)
large_screen = df[df["screen_size"] >= 6]
large_screen["brand_name"].value_counts()
Others 479 Samsung 334 Huawei 251 LG 197 Lenovo 171 ZTE 140 Xiaomi 132 Oppo 129 Asus 122 Vivo 117 Honor 116 Alcatel 115 HTC 110 Micromax 108 Motorola 106 Sony 86 Nokia 72 Meizu 62 Gionee 56 Acer 51 XOLO 49 Panasonic 47 Realme 41 Apple 39 Lava 36 Spice 30 Karbonn 29 Celkon 25 Coolpad 22 OnePlus 22 Microsoft 22 BlackBerry 21 Google 15 Infinix 10 Name: brand_name, dtype: int64
sns.countplot(data=large_screen, x=large_screen["brand_name"])
plt.xticks(rotation=90)
screen_sum = large_screen["brand_name"].value_counts().sum()
perc_lrg_scrn = (screen_sum / 3454) * 100
perc_lrg_scrn = perc_lrg_scrn.round(2)
print(
f"There are {screen_sum} phones out of 3454 total with a screen size over 6 inches."
)
print(f"This is ~{perc_lrg_scrn}% of the total number of phones.")
There are 3362 phones out of 3454 total with a screen size over 6 inches. This is ~97.34% of the total number of phones.
best_selfie_cams = df[df["selfie_camera_mp"] >= 8]
sns.displot(best_selfie_cams["selfie_camera_mp"])
<seaborn.axisgrid.FacetGrid at 0x7fca903867c0>
sns.pairplot(
df[
[
"brand_name",
"os",
"screen_size",
"main_camera_mp",
"selfie_camera_mp",
"int_memory",
"ram",
"battery",
"weight",
"days_used",
"normalized_new_price",
"normalized_used_price",
]
]
)
<seaborn.axisgrid.PairGrid at 0x7fcaa0b675e0>
plt.figure(figsize=(10, 5))
sns.heatmap(df.corr(), annot=True)
<AxesSubplot:>
# checking missing values again in the data
df.isnull().sum().sort_values(ascending=False)
main_camera_mp 179 weight 7 battery 6 int_memory 4 ram 4 selfie_camera_mp 2 brand_name 0 os 0 screen_size 0 4g 0 5g 0 release_year 0 days_used 0 normalized_used_price 0 normalized_new_price 0 dtype: int64
# Let's look at the overall data description again.
df.describe().T
| count | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|
| screen_size | 3454.0 | 13.713115 | 3.805280 | 5.080000 | 12.700000 | 12.830000 | 15.340000 | 30.710000 |
| main_camera_mp | 3275.0 | 9.460208 | 4.815461 | 0.080000 | 5.000000 | 8.000000 | 13.000000 | 48.000000 |
| selfie_camera_mp | 3452.0 | 6.554229 | 6.970372 | 0.000000 | 2.000000 | 5.000000 | 8.000000 | 32.000000 |
| int_memory | 3450.0 | 54.573099 | 84.972371 | 0.010000 | 16.000000 | 32.000000 | 64.000000 | 1024.000000 |
| ram | 3450.0 | 4.036122 | 1.365105 | 0.020000 | 4.000000 | 4.000000 | 4.000000 | 12.000000 |
| battery | 3448.0 | 3133.402697 | 1299.682844 | 500.000000 | 2100.000000 | 3000.000000 | 4000.000000 | 9720.000000 |
| weight | 3447.0 | 182.751871 | 88.413228 | 69.000000 | 142.000000 | 160.000000 | 185.000000 | 855.000000 |
| release_year | 3454.0 | 2015.965258 | 2.298455 | 2013.000000 | 2014.000000 | 2015.500000 | 2018.000000 | 2020.000000 |
| days_used | 3454.0 | 674.869716 | 248.580166 | 91.000000 | 533.500000 | 690.500000 | 868.750000 | 1094.000000 |
| normalized_used_price | 3454.0 | 4.364712 | 0.588914 | 1.536867 | 4.033931 | 4.405133 | 4.755700 | 6.619433 |
| normalized_new_price | 3454.0 | 5.233107 | 0.683637 | 2.901422 | 4.790342 | 5.245892 | 5.673718 | 7.847841 |
# Let's consider the medians for imputation of the missing values.
df.median()
screen_size 12.830000 main_camera_mp 8.000000 selfie_camera_mp 5.000000 int_memory 32.000000 ram 4.000000 battery 3000.000000 weight 160.000000 release_year 2015.500000 days_used 690.500000 normalized_used_price 4.405133 normalized_new_price 5.245892 dtype: float64
# Fill NaNs with median for each column and check our new data.
df = df.fillna(df.median())
df.sample(10)
| brand_name | os | screen_size | 4g | 5g | main_camera_mp | selfie_camera_mp | int_memory | ram | battery | weight | release_year | days_used | normalized_used_price | normalized_new_price | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1825 | Meizu | Android | 12.88 | yes | no | 21.00 | 5.0 | 32.0 | 4.0 | 3050.0 | 168.0 | 2015 | 752 | 4.846547 | 5.963297 |
| 1623 | Lenovo | Android | 17.78 | no | no | 5.00 | 0.3 | 16.0 | 4.0 | 3500.0 | 339.0 | 2013 | 899 | 4.632493 | 5.185149 |
| 1704 | LG | Android | 12.70 | yes | no | 8.00 | 2.1 | 16.0 | 4.0 | 2610.0 | 135.0 | 2015 | 980 | 4.161847 | 5.396532 |
| 2435 | Samsung | Android | 12.88 | yes | no | 12.00 | 5.0 | 64.0 | 4.0 | 3500.0 | 169.0 | 2016 | 847 | 4.994099 | 6.745413 |
| 1263 | Huawei | Android | 20.32 | yes | no | 13.00 | 8.0 | 64.0 | 4.0 | 5100.0 | 310.0 | 2019 | 516 | 4.747884 | 5.083824 |
| 3187 | Android | 15.32 | yes | no | 12.20 | 8.0 | 64.0 | 6.0 | 3700.0 | 193.0 | 2019 | 487 | 4.870146 | 6.358102 | |
| 3217 | Huawei | Android | 14.50 | yes | no | 13.00 | 5.0 | 16.0 | 2.0 | 3020.0 | 146.0 | 2019 | 382 | 3.997099 | 4.293742 |
| 447 | Acer | Android | 20.32 | no | no | 5.00 | 2.0 | 16.0 | 4.0 | 4600.0 | 360.0 | 2014 | 611 | 4.663345 | 5.295764 |
| 3029 | XOLO | Android | 10.24 | yes | no | 8.00 | 1.0 | 32.0 | 4.0 | 1810.0 | 140.0 | 2013 | 565 | 4.224349 | 5.242170 |
| 1627 | Lenovo | Android | 17.78 | no | no | 3.15 | 0.3 | 16.0 | 4.0 | 3550.0 | 400.0 | 2013 | 890 | 4.430460 | 5.001998 |
# checking missing values again in the data
df.isnull().sum().sort_values(ascending=False)
brand_name 0 os 0 screen_size 0 4g 0 5g 0 main_camera_mp 0 selfie_camera_mp 0 int_memory 0 ram 0 battery 0 weight 0 release_year 0 days_used 0 normalized_used_price 0 normalized_new_price 0 dtype: int64
# Let's drop the brand name because it may be redundant information given that we have the operating system.
df = df.drop(["brand_name"], axis=1)
# Let's examine the main camera MP now that we've imputed NaN values with the median.
sns.boxplot(data=df, x="main_camera_mp")
# Consider the heat map again to view any changes in correlations.
plt.figure(figsize=(10, 5))
sns.heatmap(df.corr(), annot=True)
<AxesSubplot:>
sns.pairplot(
df[
[
"os",
"screen_size",
"main_camera_mp",
"selfie_camera_mp",
"int_memory",
"ram",
"battery",
"weight",
"days_used",
"normalized_new_price",
"normalized_used_price",
]
]
)
<seaborn.axisgrid.PairGrid at 0x7fcac5078730>
# Let's change the yes and no in '4g' and '5g' to 1s and 0s.
df["4g"] = df["4g"].replace({"yes": 1, "no": 0})
df["5g"] = df["5g"].replace({"yes": 1, "no": 0})
df.groupby(["4g", "5g"]).size()
4g 5g
0 0 1119
1 0 2183
1 152
dtype: int64
# independent variables
X = df.drop(["normalized_used_price"], axis=1)
# dependent variable
y = df[["normalized_used_price"]]
# creating dummy variables
X = pd.get_dummies(
X,
columns=X.select_dtypes(include=["object", "category"]).columns.tolist(),
drop_first=True,
)
X.head()
| screen_size | 4g | 5g | main_camera_mp | selfie_camera_mp | int_memory | ram | battery | weight | release_year | days_used | normalized_new_price | os_Others | os_Windows | os_iOS | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 14.50 | 1 | 0 | 13.0 | 5.0 | 64.0 | 3.0 | 3020.0 | 146.0 | 2020 | 127 | 4.715100 | 0 | 0 | 0 |
| 1 | 17.30 | 1 | 1 | 13.0 | 16.0 | 128.0 | 8.0 | 4300.0 | 213.0 | 2020 | 325 | 5.519018 | 0 | 0 | 0 |
| 2 | 16.69 | 1 | 1 | 13.0 | 8.0 | 128.0 | 8.0 | 4200.0 | 213.0 | 2020 | 162 | 5.884631 | 0 | 0 | 0 |
| 3 | 25.50 | 1 | 1 | 13.0 | 8.0 | 64.0 | 6.0 | 7250.0 | 480.0 | 2020 | 345 | 5.630961 | 0 | 0 | 0 |
| 4 | 15.32 | 1 | 0 | 13.0 | 8.0 | 64.0 | 3.0 | 5000.0 | 185.0 | 2020 | 293 | 4.947837 | 0 | 0 | 0 |
# let's add the intercept to data
X = sm.add_constant(X)
Split X and y into train and test sets in a 70:30 ratio.
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.30, random_state=1
)
print("Number of rows in train data =", X_train.shape[0])
print("Number of rows in test data =", X_test.shape[0])
Number of rows in train data = 2417 Number of rows in test data = 1037
print(X_train.head())
const screen_size 4g 5g main_camera_mp selfie_camera_mp \
3026 1.0 10.29 0 0 8.0 0.3
1525 1.0 15.34 1 0 13.0 5.0
1128 1.0 12.70 0 0 13.0 5.0
3003 1.0 12.83 1 0 8.0 5.0
2907 1.0 12.88 1 0 13.0 16.0
int_memory ram battery weight release_year days_used \
3026 16.0 4.0 1800.0 120.0 2014 819
1525 32.0 4.0 4050.0 225.0 2016 585
1128 32.0 4.0 2550.0 162.0 2015 727
3003 16.0 4.0 3200.0 160.0 2015 800
2907 16.0 4.0 2900.0 160.0 2017 560
normalized_new_price os_Others os_Windows os_iOS
3026 4.796204 0 0 0
1525 5.434595 0 0 0
1128 5.137914 0 0 0
3003 5.189228 0 0 0
2907 5.016220 0 0 0
print(X_test.head())
const screen_size 4g 5g main_camera_mp selfie_camera_mp \
866 1.0 15.24 0 0 8.00 2.0
957 1.0 10.16 0 0 3.15 0.3
280 1.0 15.39 1 0 8.00 8.0
2150 1.0 12.83 1 0 13.00 16.0
93 1.0 15.29 1 0 13.00 5.0
int_memory ram battery weight release_year days_used \
866 16.0 4.00 3000.0 206.0 2014 632
957 512.0 0.25 1400.0 140.0 2013 637
280 32.0 2.00 5000.0 185.0 2020 329
2150 64.0 4.00 3200.0 148.0 2017 648
93 32.0 3.00 3500.0 179.0 2019 216
normalized_new_price os_Others os_Windows os_iOS
866 5.190788 0 0 0
957 3.884652 0 0 0
280 4.487287 0 0 0
2150 5.639422 0 0 0
93 5.300415 0 0 0
olsmodel = sm.OLS(y_train, X_train).fit()
print(olsmodel.summary())
OLS Regression Results
=================================================================================
Dep. Variable: normalized_used_price R-squared: 0.841
Model: OLS Adj. R-squared: 0.840
Method: Least Squares F-statistic: 847.8
Date: Thu, 24 Mar 2022 Prob (F-statistic): 0.00
Time: 21:55:49 Log-Likelihood: 95.399
No. Observations: 2417 AIC: -158.8
Df Residuals: 2401 BIC: -66.15
Df Model: 15
Covariance Type: nonrobust
========================================================================================
coef std err t P>|t| [0.025 0.975]
----------------------------------------------------------------------------------------
const -53.1615 8.970 -5.927 0.000 -70.751 -35.572
screen_size 0.0251 0.003 7.551 0.000 0.019 0.032
4g 0.0473 0.015 3.092 0.002 0.017 0.077
5g -0.0156 0.031 -0.500 0.617 -0.077 0.046
main_camera_mp 0.0204 0.001 14.513 0.000 0.018 0.023
selfie_camera_mp 0.0140 0.001 12.913 0.000 0.012 0.016
int_memory 9.026e-05 6.7e-05 1.347 0.178 -4.12e-05 0.000
ram 0.0214 0.005 4.289 0.000 0.012 0.031
battery -1.074e-05 7.09e-06 -1.515 0.130 -2.46e-05 3.16e-06
weight 0.0009 0.000 7.021 0.000 0.001 0.001
release_year 0.0270 0.004 6.075 0.000 0.018 0.036
days_used 3.729e-05 3.06e-05 1.219 0.223 -2.27e-05 9.73e-05
normalized_new_price 0.4202 0.011 36.935 0.000 0.398 0.443
os_Others -0.0537 0.029 -1.826 0.068 -0.111 0.004
os_Windows 0.0271 0.036 0.745 0.456 -0.044 0.098
os_iOS -0.0635 0.045 -1.420 0.156 -0.151 0.024
==============================================================================
Omnibus: 219.770 Durbin-Watson: 1.904
Prob(Omnibus): 0.000 Jarque-Bera (JB): 413.839
Skew: -0.612 Prob(JB): 1.37e-90
Kurtosis: 4.616 Cond. No. 7.47e+06
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 7.47e+06. This might indicate that there are
strong multicollinearity or other numerical problems.
from statsmodels.stats.outliers_influence import variance_inflation_factor
vif_series1 = pd.Series(
[variance_inflation_factor(X_train.values, i) for i in range(X_train.shape[1])],
index=X_train.columns,
)
print("VIF values: \n\n{}\n".format(vif_series1))
VIF values: const 3.570595e+06 screen_size 7.257522e+00 4g 2.269786e+00 5g 1.762858e+00 main_camera_mp 1.924868e+00 selfie_camera_mp 2.572363e+00 int_memory 1.247150e+00 ram 2.103621e+00 battery 3.835618e+00 weight 6.120544e+00 release_year 4.613026e+00 days_used 2.589619e+00 normalized_new_price 2.658599e+00 os_Others 1.475205e+00 os_Windows 1.023196e+00 os_iOS 1.089057e+00 dtype: float64
X_train2 = X_train.drop(["screen_size"], axis=1)
olsmod_1 = sm.OLS(y_train, X_train2)
olsres_1 = olsmod_1.fit()
print(
"R-squared:",
np.round(olsres_1.rsquared, 3),
"\nAdjusted R-squared:",
np.round(olsres_1.rsquared_adj, 3),
)
R-squared: 0.837 Adjusted R-squared: 0.836
X_train3 = X_train.drop(["weight"], axis=1)
olsmod_2 = sm.OLS(y_train, X_train3)
olsres_2 = olsmod_2.fit()
print(
"R-squared:",
np.round(olsres_2.rsquared, 3),
"\nAdjusted R-squared:",
np.round(olsres_2.rsquared_adj, 3),
)
R-squared: 0.838 Adjusted R-squared: 0.837
X_train4 = X_train.drop(["release_year"], axis=1)
olsmod_3 = sm.OLS(y_train, X_train4)
olsres_3 = olsmod_3.fit()
print(
"R-squared:",
np.round(olsres_3.rsquared, 3),
"\nAdjusted R-squared:",
np.round(olsres_3.rsquared_adj, 3),
)
R-squared: 0.839 Adjusted R-squared: 0.838
X_train5 = X_train.drop(["battery"], axis=1)
olsmod_4 = sm.OLS(y_train, X_train5)
olsres_4 = olsmod_4.fit()
print(
"R-squared:",
np.round(olsres_4.rsquared, 3),
"\nAdjusted R-squared:",
np.round(olsres_4.rsquared_adj, 3),
)
R-squared: 0.841 Adjusted R-squared: 0.84
X_train = X_train.drop(["battery"], axis=1)
olsmod_5 = sm.OLS(y_train, X_train)
olsres_5 = olsmod_5.fit()
print(olsres_5.summary())
OLS Regression Results
=================================================================================
Dep. Variable: normalized_used_price R-squared: 0.841
Model: OLS Adj. R-squared: 0.840
Method: Least Squares F-statistic: 907.8
Date: Thu, 24 Mar 2022 Prob (F-statistic): 0.00
Time: 21:55:49 Log-Likelihood: 94.243
No. Observations: 2417 AIC: -158.5
Df Residuals: 2402 BIC: -71.63
Df Model: 14
Covariance Type: nonrobust
========================================================================================
coef std err t P>|t| [0.025 0.975]
----------------------------------------------------------------------------------------
const -50.5899 8.810 -5.742 0.000 -67.867 -33.313
screen_size 0.0239 0.003 7.405 0.000 0.018 0.030
4g 0.0444 0.015 2.929 0.003 0.015 0.074
5g -0.0149 0.031 -0.478 0.633 -0.076 0.046
main_camera_mp 0.0202 0.001 14.432 0.000 0.017 0.023
selfie_camera_mp 0.0139 0.001 12.869 0.000 0.012 0.016
int_memory 9.094e-05 6.7e-05 1.356 0.175 -4.05e-05 0.000
ram 0.0212 0.005 4.258 0.000 0.011 0.031
weight 0.0009 0.000 6.889 0.000 0.001 0.001
release_year 0.0257 0.004 5.893 0.000 0.017 0.034
days_used 3.831e-05 3.06e-05 1.252 0.211 -2.17e-05 9.83e-05
normalized_new_price 0.4190 0.011 36.914 0.000 0.397 0.441
os_Others -0.0560 0.029 -1.907 0.057 -0.114 0.002
os_Windows 0.0306 0.036 0.844 0.399 -0.041 0.102
os_iOS -0.0607 0.045 -1.357 0.175 -0.148 0.027
==============================================================================
Omnibus: 219.813 Durbin-Watson: 1.905
Prob(Omnibus): 0.000 Jarque-Bera (JB): 413.866
Skew: -0.612 Prob(JB): 1.35e-90
Kurtosis: 4.616 Cond. No. 3.96e+06
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 3.96e+06. This might indicate that there are
strong multicollinearity or other numerical problems.
vif_series2 = pd.Series(
[variance_inflation_factor(X_train.values, i) for i in range(X_train.shape[1])],
index=X_train.columns,
)
print("VIF values: \n\n{}\n".format(vif_series2))
VIF values: const 3.442813e+06 screen_size 6.809089e+00 4g 2.236092e+00 5g 1.762478e+00 main_camera_mp 1.909981e+00 selfie_camera_mp 2.569883e+00 int_memory 1.247095e+00 ram 2.102766e+00 weight 5.527262e+00 release_year 4.448187e+00 days_used 2.588378e+00 normalized_new_price 2.644157e+00 os_Others 1.471227e+00 os_Windows 1.019005e+00 os_iOS 1.087090e+00 dtype: float64
X_train6 = X_train.drop(["release_year"], axis=1)
olsmod_6 = sm.OLS(y_train, X_train6)
olsres_6 = olsmod_6.fit()
print(
"R-squared:",
np.round(olsres_6.rsquared, 3),
"\nAdjusted R-squared:",
np.round(olsres_6.rsquared_adj, 3),
)
R-squared: 0.839 Adjusted R-squared: 0.838
X_train7 = X_train.drop(["screen_size"], axis=1)
olsmod_7 = sm.OLS(y_train, X_train7)
olsres_7 = olsmod_7.fit()
print(
"R-squared:",
np.round(olsres_7.rsquared, 3),
"\nAdjusted R-squared:",
np.round(olsres_7.rsquared_adj, 3),
)
R-squared: 0.837 Adjusted R-squared: 0.837
X_train8 = X_train.drop(["weight"], axis=1)
olsmod_8 = sm.OLS(y_train, X_train8)
olsres_8 = olsmod_8.fit()
print(
"R-squared:",
np.round(olsres_8.rsquared, 3),
"\nAdjusted R-squared:",
np.round(olsres_8.rsquared_adj, 3),
)
R-squared: 0.838 Adjusted R-squared: 0.837
Since there is a very small effect (0.002) on adj. R-squared after dropping the 'release_year' column, we can remove it from the training set.
X_train = X_train.drop(["release_year"], axis=1)
olsmod_9 = sm.OLS(y_train, X_train)
olsres_9 = olsmod_9.fit()
print(olsres_9.summary())
OLS Regression Results
=================================================================================
Dep. Variable: normalized_used_price R-squared: 0.839
Model: OLS Adj. R-squared: 0.838
Method: Least Squares F-statistic: 961.4
Date: Thu, 24 Mar 2022 Prob (F-statistic): 0.00
Time: 21:55:49 Log-Likelihood: 76.893
No. Observations: 2417 AIC: -125.8
Df Residuals: 2403 BIC: -44.72
Df Model: 13
Covariance Type: nonrobust
========================================================================================
coef std err t P>|t| [0.025 0.975]
----------------------------------------------------------------------------------------
const 1.3334 0.053 25.358 0.000 1.230 1.437
screen_size 0.0285 0.003 9.038 0.000 0.022 0.035
4g 0.0838 0.014 6.103 0.000 0.057 0.111
5g 0.0071 0.031 0.227 0.820 -0.054 0.068
main_camera_mp 0.0207 0.001 14.714 0.000 0.018 0.023
selfie_camera_mp 0.0161 0.001 15.808 0.000 0.014 0.018
int_memory 0.0001 6.73e-05 1.782 0.075 -1.21e-05 0.000
ram 0.0199 0.005 3.975 0.000 0.010 0.030
weight 0.0008 0.000 6.068 0.000 0.001 0.001
days_used -6.316e-05 2.55e-05 -2.480 0.013 -0.000 -1.32e-05
normalized_new_price 0.4024 0.011 36.341 0.000 0.381 0.424
os_Others -0.0465 0.030 -1.577 0.115 -0.104 0.011
os_Windows 0.0264 0.037 0.724 0.469 -0.045 0.098
os_iOS -0.0452 0.045 -1.005 0.315 -0.133 0.043
==============================================================================
Omnibus: 207.390 Durbin-Watson: 1.908
Prob(Omnibus): 0.000 Jarque-Bera (JB): 373.773
Skew: -0.597 Prob(JB): 6.86e-82
Kurtosis: 4.512 Cond. No. 8.72e+03
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 8.72e+03. This might indicate that there are
strong multicollinearity or other numerical problems.
vif_series3 = pd.Series(
[variance_inflation_factor(X_train.values, i) for i in range(X_train.shape[1])],
index=X_train.columns,
)
print("VIF values: \n\n{}\n".format(vif_series3))
VIF values: const 120.930292 screen_size 6.410666 4g 1.803877 5g 1.737245 main_camera_mp 1.902949 selfie_camera_mp 2.258109 int_memory 1.240372 ram 2.098719 weight 5.414534 days_used 1.768842 normalized_new_price 2.481703 os_Others 1.466842 os_Windows 1.018617 os_iOS 1.083330 dtype: float64
Multicollinearity is still present in our data, and hence, we should drop the 'screen_size' column as well.
X_train = X_train.drop(["screen_size"], axis=1)
vif_series4 = pd.Series(
[variance_inflation_factor(X_train.values, i) for i in range(X_train.shape[1])],
index=X_train.columns,
)
print("Series before feature selection: \n\n{}\n".format(vif_series4))
Series before feature selection: const 102.494809 4g 1.769824 5g 1.736735 main_camera_mp 1.887600 selfie_camera_mp 2.206567 int_memory 1.240313 ram 2.098450 weight 1.244897 days_used 1.641063 normalized_new_price 2.468549 os_Others 1.249403 os_Windows 1.017561 os_iOS 1.080976 dtype: float64
olsmod_10 = sm.OLS(y_train, X_train)
olsres_10 = olsmod_10.fit()
print(olsres_10.summary())
OLS Regression Results
=================================================================================
Dep. Variable: normalized_used_price R-squared: 0.833
Model: OLS Adj. R-squared: 0.832
Method: Least Squares F-statistic: 1001.
Date: Thu, 24 Mar 2022 Prob (F-statistic): 0.00
Time: 21:55:49 Log-Likelihood: 36.498
No. Observations: 2417 AIC: -47.00
Df Residuals: 2404 BIC: 28.28
Df Model: 12
Covariance Type: nonrobust
========================================================================================
coef std err t P>|t| [0.025 0.975]
----------------------------------------------------------------------------------------
const 1.5189 0.049 30.864 0.000 1.422 1.615
4g 0.1008 0.014 7.293 0.000 0.074 0.128
5g 0.0023 0.032 0.071 0.943 -0.060 0.064
main_camera_mp 0.0218 0.001 15.333 0.000 0.019 0.025
selfie_camera_mp 0.0175 0.001 17.089 0.000 0.016 0.020
int_memory 0.0001 6.84e-05 1.691 0.091 -1.85e-05 0.000
ram 0.0205 0.005 4.011 0.000 0.010 0.030
weight 0.0017 6.06e-05 28.717 0.000 0.002 0.002
days_used -0.0001 2.49e-05 -5.013 0.000 -0.000 -7.61e-05
normalized_new_price 0.4097 0.011 36.490 0.000 0.388 0.432
os_Others -0.1492 0.028 -5.389 0.000 -0.203 -0.095
os_Windows 0.0158 0.037 0.426 0.670 -0.057 0.089
os_iOS -0.0641 0.046 -1.404 0.160 -0.154 0.025
==============================================================================
Omnibus: 217.149 Durbin-Watson: 1.906
Prob(Omnibus): 0.000 Jarque-Bera (JB): 381.854
Skew: -0.628 Prob(JB): 1.21e-83
Kurtosis: 4.488 Cond. No. 8.17e+03
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 8.17e+03. This might indicate that there are
strong multicollinearity or other numerical problems.
X_train11 = X_train.drop(["normalized_new_price"], axis=1)
olsmod_11 = sm.OLS(y_train, X_train11)
olsres_11 = olsmod_11.fit()
print(
"R-squared:",
np.round(olsres_11.rsquared, 3),
"\nAdjusted R-squared:",
np.round(olsres_11.rsquared_adj, 3),
)
R-squared: 0.741 Adjusted R-squared: 0.74
X_train12 = X_train.drop(["selfie_camera_mp"], axis=1)
olsmod_12 = sm.OLS(y_train, X_train12)
olsres_12 = olsmod_12.fit()
print(
"R-squared:",
np.round(olsres_12.rsquared, 3),
"\nAdjusted R-squared:",
np.round(olsres_12.rsquared_adj, 3),
)
R-squared: 0.813 Adjusted R-squared: 0.812
X_train13 = X_train.drop(["ram"], axis=1)
olsmod_13 = sm.OLS(y_train, X_train13)
olsres_13 = olsmod_13.fit()
print(
"R-squared:",
np.round(olsres_13.rsquared, 3),
"\nAdjusted R-squared:",
np.round(olsres_13.rsquared_adj, 3),
)
R-squared: 0.832 Adjusted R-squared: 0.831
X_train = X_train.drop(["ram"], axis=1)
vif_series5 = pd.Series(
[variance_inflation_factor(X_train.values, i) for i in range(X_train.shape[1])],
index=X_train.columns,
)
print("Series before feature selection: \n\n{}\n".format(vif_series5))
Series before feature selection: const 102.076822 4g 1.760814 5g 1.386330 main_camera_mp 1.883288 selfie_camera_mp 2.098606 int_memory 1.223760 weight 1.242487 days_used 1.640664 normalized_new_price 2.247655 os_Others 1.173925 os_Windows 1.016632 os_iOS 1.077330 dtype: float64
olsmod_13 = sm.OLS(y_train, X_train)
olsres_13 = olsmod_13.fit()
print(olsres_13.summary())
OLS Regression Results
=================================================================================
Dep. Variable: normalized_used_price R-squared: 0.832
Model: OLS Adj. R-squared: 0.831
Method: Least Squares F-statistic: 1084.
Date: Thu, 24 Mar 2022 Prob (F-statistic): 0.00
Time: 21:55:49 Log-Likelihood: 28.437
No. Observations: 2417 AIC: -32.87
Df Residuals: 2405 BIC: 36.61
Df Model: 11
Covariance Type: nonrobust
========================================================================================
coef std err t P>|t| [0.025 0.975]
----------------------------------------------------------------------------------------
const 1.5315 0.049 31.086 0.000 1.435 1.628
4g 0.0969 0.014 7.003 0.000 0.070 0.124
5g 0.0594 0.028 2.090 0.037 0.004 0.115
main_camera_mp 0.0215 0.001 15.111 0.000 0.019 0.024
selfie_camera_mp 0.0184 0.001 18.375 0.000 0.016 0.020
int_memory 8.403e-05 6.82e-05 1.232 0.218 -4.97e-05 0.000
weight 0.0017 6.07e-05 28.479 0.000 0.002 0.002
days_used -0.0001 2.5e-05 -4.936 0.000 -0.000 -7.44e-05
normalized_new_price 0.4232 0.011 39.375 0.000 0.402 0.444
os_Others -0.1765 0.027 -6.556 0.000 -0.229 -0.124
os_Windows 0.0203 0.037 0.546 0.585 -0.053 0.093
os_iOS -0.0747 0.046 -1.635 0.102 -0.164 0.015
==============================================================================
Omnibus: 240.732 Durbin-Watson: 1.907
Prob(Omnibus): 0.000 Jarque-Bera (JB): 421.733
Skew: -0.684 Prob(JB): 2.64e-92
Kurtosis: 4.522 Cond. No. 8.15e+03
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 8.15e+03. This might indicate that there are
strong multicollinearity or other numerical problems.
X_train = X_train.drop(["os_Windows"], axis=1)
olsmod_14 = sm.OLS(y_train, X_train)
olsres_14 = olsmod_14.fit()
print(olsres_14.summary())
OLS Regression Results
=================================================================================
Dep. Variable: normalized_used_price R-squared: 0.832
Model: OLS Adj. R-squared: 0.831
Method: Least Squares F-statistic: 1193.
Date: Thu, 24 Mar 2022 Prob (F-statistic): 0.00
Time: 21:55:49 Log-Likelihood: 28.287
No. Observations: 2417 AIC: -34.57
Df Residuals: 2406 BIC: 29.12
Df Model: 10
Covariance Type: nonrobust
========================================================================================
coef std err t P>|t| [0.025 0.975]
----------------------------------------------------------------------------------------
const 1.5318 0.049 31.099 0.000 1.435 1.628
4g 0.0965 0.014 6.985 0.000 0.069 0.124
5g 0.0595 0.028 2.093 0.036 0.004 0.115
main_camera_mp 0.0215 0.001 15.106 0.000 0.019 0.024
selfie_camera_mp 0.0184 0.001 18.370 0.000 0.016 0.020
int_memory 8.339e-05 6.82e-05 1.223 0.221 -5.03e-05 0.000
weight 0.0017 6.07e-05 28.483 0.000 0.002 0.002
days_used -0.0001 2.5e-05 -4.922 0.000 -0.000 -7.4e-05
normalized_new_price 0.4233 0.011 39.398 0.000 0.402 0.444
os_Others -0.1772 0.027 -6.589 0.000 -0.230 -0.124
os_iOS -0.0750 0.046 -1.640 0.101 -0.165 0.015
==============================================================================
Omnibus: 241.327 Durbin-Watson: 1.908
Prob(Omnibus): 0.000 Jarque-Bera (JB): 423.294
Skew: -0.685 Prob(JB): 1.21e-92
Kurtosis: 4.525 Cond. No. 8.15e+03
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 8.15e+03. This might indicate that there are
strong multicollinearity or other numerical problems.
X_train = X_train.drop(["int_memory"], axis=1)
olsmod_15 = sm.OLS(y_train, X_train)
olsres_15 = olsmod_15.fit()
print(olsres_15.summary())
OLS Regression Results
=================================================================================
Dep. Variable: normalized_used_price R-squared: 0.832
Model: OLS Adj. R-squared: 0.831
Method: Least Squares F-statistic: 1325.
Date: Thu, 24 Mar 2022 Prob (F-statistic): 0.00
Time: 21:55:49 Log-Likelihood: 27.536
No. Observations: 2417 AIC: -35.07
Df Residuals: 2407 BIC: 22.83
Df Model: 9
Covariance Type: nonrobust
========================================================================================
coef std err t P>|t| [0.025 0.975]
----------------------------------------------------------------------------------------
const 1.5287 0.049 31.074 0.000 1.432 1.625
4g 0.0950 0.014 6.905 0.000 0.068 0.122
5g 0.0629 0.028 2.222 0.026 0.007 0.118
main_camera_mp 0.0214 0.001 15.055 0.000 0.019 0.024
selfie_camera_mp 0.0186 0.001 18.902 0.000 0.017 0.021
weight 0.0017 6.05e-05 28.461 0.000 0.002 0.002
days_used -0.0001 2.49e-05 -5.065 0.000 -0.000 -7.73e-05
normalized_new_price 0.4254 0.011 40.128 0.000 0.405 0.446
os_Others -0.1750 0.027 -6.522 0.000 -0.228 -0.122
os_iOS -0.0739 0.046 -1.616 0.106 -0.163 0.016
==============================================================================
Omnibus: 234.844 Durbin-Watson: 1.908
Prob(Omnibus): 0.000 Jarque-Bera (JB): 417.952
Skew: -0.665 Prob(JB): 1.75e-91
Kurtosis: 4.543 Cond. No. 8.13e+03
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 8.13e+03. This might indicate that there are
strong multicollinearity or other numerical problems.
X_train = X_train.drop(["os_iOS"], axis=1)
olsmod_16 = sm.OLS(y_train, X_train)
olsres_16 = olsmod_16.fit()
print(olsres_16.summary())
OLS Regression Results
=================================================================================
Dep. Variable: normalized_used_price R-squared: 0.832
Model: OLS Adj. R-squared: 0.831
Method: Least Squares F-statistic: 1489.
Date: Thu, 24 Mar 2022 Prob (F-statistic): 0.00
Time: 21:55:49 Log-Likelihood: 26.225
No. Observations: 2417 AIC: -34.45
Df Residuals: 2408 BIC: 17.66
Df Model: 8
Covariance Type: nonrobust
========================================================================================
coef std err t P>|t| [0.025 0.975]
----------------------------------------------------------------------------------------
const 1.5438 0.048 31.955 0.000 1.449 1.639
4g 0.0936 0.014 6.813 0.000 0.067 0.121
5g 0.0662 0.028 2.346 0.019 0.011 0.122
main_camera_mp 0.0216 0.001 15.229 0.000 0.019 0.024
selfie_camera_mp 0.0188 0.001 19.142 0.000 0.017 0.021
weight 0.0017 6.04e-05 28.407 0.000 0.002 0.002
days_used -0.0001 2.49e-05 -5.032 0.000 -0.000 -7.64e-05
normalized_new_price 0.4221 0.010 40.562 0.000 0.402 0.443
os_Others -0.1767 0.027 -6.590 0.000 -0.229 -0.124
==============================================================================
Omnibus: 235.934 Durbin-Watson: 1.911
Prob(Omnibus): 0.000 Jarque-Bera (JB): 418.029
Skew: -0.669 Prob(JB): 1.68e-91
Kurtosis: 4.536 Cond. No. 7.62e+03
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 7.62e+03. This might indicate that there are
strong multicollinearity or other numerical problems.
vif_series6 = pd.Series(
[variance_inflation_factor(X_train.values, i) for i in range(X_train.shape[1])],
index=X_train.columns,
)
print("Series before feature selection: \n\n{}\n".format(vif_series6))
Series before feature selection: const 98.095716 4g 1.735824 5g 1.365891 main_camera_mp 1.856913 selfie_camera_mp 2.009243 weight 1.229998 days_used 1.622880 normalized_new_price 2.106338 os_Others 1.164506 dtype: float64
df_pred = pd.DataFrame()
df_pred["Actual Values"] = y_train.values.flatten() # actual values
df_pred["Fitted Values"] = olsres_10.fittedvalues.values # predicted values
df_pred["Residuals"] = olsres_10.resid.values # residuals
df_pred.head()
| Actual Values | Fitted Values | Residuals | |
|---|---|---|---|
| 0 | 4.087488 | 3.853581 | 0.233907 |
| 1 | 4.448399 | 4.621082 | -0.172682 |
| 2 | 4.315353 | 4.271353 | 0.044000 |
| 3 | 4.282068 | 4.269711 | 0.012357 |
| 4 | 4.456438 | 4.530547 | -0.074109 |
# let us plot the fitted values vs residuals
sns.set_style("whitegrid")
sns.residplot(
data=df_pred, x="Fitted Values", y="Residuals", color="purple", lowess=True
)
plt.xlabel("Fitted Values")
plt.ylabel("Residuals")
plt.title("Fitted vs Residual plot")
plt.show()
Observations
# checking the distribution of variables in training set with dependent variable
sns.pairplot(
df[
[
"normalized_new_price",
"selfie_camera_mp",
"main_camera_mp",
"4g",
"5g",
"days_used",
]
]
)
plt.show()
sns.histplot(data=df_pred, x="Residuals", kde=True)
plt.title("Normality of residuals")
plt.show()
Observations
import pylab
import scipy.stats as stats
stats.probplot(df_pred["Residuals"], dist="norm", plot=pylab)
plt.show()
Observations
stats.shapiro(df_pred["Residuals"])
ShapiroResult(statistic=0.9706935882568359, pvalue=7.96311690159937e-22)
Observations
The null and alternate hypotheses of the goldfeldquandt test are as follows:
import statsmodels.stats.api as sms
from statsmodels.compat import lzip
name = ["F statistic", "p-value"]
test = sms.het_goldfeldquandt(df_pred["Residuals"], X_train8)
lzip(name, test)
[('F statistic', 1.0292957119364605), ('p-value', 0.3089338617486743)]
print(olsres_16.summary())
OLS Regression Results
=================================================================================
Dep. Variable: normalized_used_price R-squared: 0.832
Model: OLS Adj. R-squared: 0.831
Method: Least Squares F-statistic: 1489.
Date: Thu, 24 Mar 2022 Prob (F-statistic): 0.00
Time: 21:55:53 Log-Likelihood: 26.225
No. Observations: 2417 AIC: -34.45
Df Residuals: 2408 BIC: 17.66
Df Model: 8
Covariance Type: nonrobust
========================================================================================
coef std err t P>|t| [0.025 0.975]
----------------------------------------------------------------------------------------
const 1.5438 0.048 31.955 0.000 1.449 1.639
4g 0.0936 0.014 6.813 0.000 0.067 0.121
5g 0.0662 0.028 2.346 0.019 0.011 0.122
main_camera_mp 0.0216 0.001 15.229 0.000 0.019 0.024
selfie_camera_mp 0.0188 0.001 19.142 0.000 0.017 0.021
weight 0.0017 6.04e-05 28.407 0.000 0.002 0.002
days_used -0.0001 2.49e-05 -5.032 0.000 -0.000 -7.64e-05
normalized_new_price 0.4221 0.010 40.562 0.000 0.402 0.443
os_Others -0.1767 0.027 -6.590 0.000 -0.229 -0.124
==============================================================================
Omnibus: 235.934 Durbin-Watson: 1.911
Prob(Omnibus): 0.000 Jarque-Bera (JB): 418.029
Skew: -0.669 Prob(JB): 1.68e-91
Kurtosis: 4.536 Cond. No. 7.62e+03
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 7.62e+03. This might indicate that there are
strong multicollinearity or other numerical problems.
R-squared of the model is 0.832 and adjusted R-squared is 0.831, which shows that the model is able to explain ~83% variance in the data. This is quite good.
A unit increase in the normalized new price will result in a 0.4221 unit increase in the phone's normalized used price, all other variables remaining constant.
If a phone has 4g, it will result in a 0.0936 unit increase in the phone's normalized used price, all other variables remaining constant.
If a phone has 5g, it will result in a 0.0662 unit increase in the phone's normalized used price, all other variables remaining constant.
If a phone has any operating system other than Windows, iOS, or Android, it will result in a 0.1767 decrease in the phone's used normalized price.
# let's check the model parameters
olsres_16.params
const 1.543787 4g 0.093592 5g 0.066201 main_camera_mp 0.021553 selfie_camera_mp 0.018787 weight 0.001717 days_used -0.000125 normalized_new_price 0.422106 os_Others -0.176735 dtype: float64
# Let us write the equation of linear regression
Equation = "mpg ="
print(Equation, end=" ")
for i in range(len(X_train.columns)):
if i == 0:
print(olsres_16.params[i], "+", end=" ")
elif i != len(X_train.columns) - 1:
print(
olsres_16.params[i], "* (", X_train.columns[i], ")", "+", end=" ",
)
else:
print(olsres_16.params[i], "* (", X_train.columns[i], ")")
mpg = 1.5437870014114647 + 0.09359239953688234 * ( 4g ) + 0.06620102776280007 * ( 5g ) + 0.021553405922728825 * ( main_camera_mp ) + 0.01878735956524039 * ( selfie_camera_mp ) + 0.0017165307638577984 * ( weight ) + -0.00012523331272874675 * ( days_used ) + 0.4221061701066799 * ( normalized_new_price ) + -0.1767351800095593 * ( os_Others )
X_train.columns
Index(['const', '4g', '5g', 'main_camera_mp', 'selfie_camera_mp', 'weight',
'days_used', 'normalized_new_price', 'os_Others'],
dtype='object')
X_test.columns
Index(['const', 'screen_size', '4g', '5g', 'main_camera_mp',
'selfie_camera_mp', 'int_memory', 'ram', 'battery', 'weight',
'release_year', 'days_used', 'normalized_new_price', 'os_Others',
'os_Windows', 'os_iOS'],
dtype='object')
# dropping columns from the test data that are not there in the training data
X_test2 = X_test.drop(
[
"screen_size",
"int_memory",
"ram",
"battery",
"release_year",
"os_Windows",
"os_iOS",
],
axis=1,
)
X_test2.columns
Index(['const', '4g', '5g', 'main_camera_mp', 'selfie_camera_mp', 'weight',
'days_used', 'normalized_new_price', 'os_Others'],
dtype='object')
# let's make predictions on the test set
y_pred = olsres_16.predict(X_test2)
# let's check the RMSE on the train data
rmse1 = np.sqrt(mean_squared_error(y_train, df_pred["Fitted Values"]))
rmse1
0.23834430408761062
# let's check the RMSE on the test data
rmse2 = np.sqrt(mean_squared_error(y_test, y_pred))
rmse2
0.2460800264351214
# let's check the MAE on the train data
mae1 = mean_absolute_error(y_train, df_pred["Fitted Values"])
mae1
0.18697133134689664
# let's check the MAE on the test data
mae2 = mean_absolute_error(y_test, y_pred)
mae2
0.18882679501742966
1) Normalized new price has the largest effect on the used price of the phone. If a company wishes to offer a higher priced used phone, they should seek out phones that were more highly priced when sold new.
2) This data does not offer information on return on investment or profitability. Further research should be done to consider profits of used phones.
2) Phones with commmon operating systems: Android, iOS, and Windows should be preferred over others. Others have a negative impact on the normalized used price.
3) 4g and 5g phones should be preferred over others. They have a greater impact on the normalized used price.
4) A unit increase in the normalized new price will result in a 0.4221 unit increase in the phone's normalized used price, all other variables remaining constant.
5) If a phone has 4g, it will result in a 0.0936 unit increase in the phone's normalized used price, all other variables remaining constant.
6) If a phone has 5g, it will result in a 0.0662 unit increase in the phone's normalized used price, all other variables remaining constant.
7) If a phone has any operating system other than Windows, iOS, or Android, it will result in a 0.1767 decrease in the phone's used normalized price.